{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Lesson 2\n",
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Some Glimpses of Financial Data Wrangling "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "![](Lesson2GoalHeaderImage.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##   2.1   Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The \"raw\" financial datasets based on which crucial investment decisions are taken, cannot be used on an as-is-where-is basis, despite the fact that these are dynamically generated on-line as the trading proceeds at the stock exchanges concerned. The financial datasets need to be cleaned and transformed to enable smooth functioning of the models that work over these datasets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Data wrangling** or **Data munging** or the more familiar **Data pre-processing**, deals with a slew of techniques that serve to clean the data, transform the data  and render it fit for use by various models that will work over it, to undertake operations or analysis or research. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With regard to \"raw\" financial datasets, the following problems may be encountered:  \n",
    "\n",
    "(1) blank tuples (rows) in the dataset, and   \n",
    "(2) missing values of certain fields in the dataset  \n",
    "\n",
    "Fig. 2.1 illustrates a snapshot of a financial dataset extracted from S&P BSE200 (Bombay Stock Exchange, India) for the period Feb 17, 2002 to Feb 28, 2002, that has these aformentioned problems. The empty rows, columns and missing values,  can be seen in the \"raw\" dataset."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](Lesson2Fig2_1.png)\n",
    "#### Fig. 2.1 Snapshot of a finance dataset that calls for data wrangling\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.2      Eliminating empty rows"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Considering the size of the finance datasets,  which needless to say can be very huge, the removal of empty rows from the dataset can be accomplished by coding a function for the same. The Python function **EmptyRowsElimination**,  reads the stock prices from the dataset to eliminate the empty rows in the dataset using **dropna** method of Pandas. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "#function to eliminate empty rows in a dataset\n",
    "def EmptyRowsElimination(dfAssetPrices):\n",
    "\n",
    "    # read dataset and extract its dimensions\n",
    "    [Rows, Columns] = dfAssetPrices.shape\n",
    "    dFrame = dfAssetPrices.iloc[0:Rows, 0:Columns]\n",
    "    \n",
    "    # call dropna method from Pandas \n",
    "    dFClean = dFrame.dropna(axis =0, how ='all')\n",
    "    return dFClean"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Python code invoking the function **EmptyRowsElimination** over the example dataset shown in Fig. 2.1 is given below. The output shows that the  two empty rows corresponding to dates, 23-02-2002 and 24-02-2002 have been removed. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['RELIANCE INDUSTRIES LIMITED', 'INFOSYS TECHNOLOGIES LTD', 'ITC LTD', 'BHARTI AIRTEL LIMITED', 'HOUSING DEVELOPMENT FINANCE', 'LARSEN & TOUBRO LIMITED', 'ICICI BANK LTD', 'HDFC BANK LIMITED', 'OIL & NATURAL GAS CORP LTD', 'STATE BANK OF INDIA', 'HINDUSTAN UNILEVER LIMITED', 'BHARAT HEAVY ELECTRICALS', 'NTPC LIMITED', 'TATA CONSULTANCY SVS LTD', 'GAIL INDIA LTD', 'CIPLA LTD', 'GRASIM INDUSTRIES LTD', 'TATA POWER CO LTD']\n",
      "\n",
      "Data cleaning completed!\n",
      "Dimensions of the cleaned dataset (10, 18)\n",
      "Cleaned dataset: \n",
      "     RELIANCE INDUSTRIES LIMITED  INFOSYS TECHNOLOGIES LTD  ITC LTD  \\\n",
      "0                           NaN                       NaN      NaN   \n",
      "1                        245.23                    466.28    49.84   \n",
      "2                        246.77                    462.74    50.05   \n",
      "3                        244.11                    456.34    48.49   \n",
      "4                        241.34                    457.45    49.16   \n",
      "5                        242.15                    464.64    49.97   \n",
      "8                        242.11                    468.44    48.42   \n",
      "9                        247.50                    480.57    48.33   \n",
      "10                       249.80                    479.23    48.70   \n",
      "11                       236.50                    441.28    49.77   \n",
      "\n",
      "    BHARTI AIRTEL LIMITED  HOUSING DEVELOPMENT FINANCE  \\\n",
      "0                   45.00                          NaN   \n",
      "1                   44.15                       338.60   \n",
      "2                   41.20                       337.35   \n",
      "3                   41.40                       334.82   \n",
      "4                   42.45                       330.13   \n",
      "5                   43.45                       335.00   \n",
      "8                   42.65                       336.48   \n",
      "9                   43.55                       340.82   \n",
      "10                  44.30                       333.52   \n",
      "11                  40.90                       340.90   \n",
      "\n",
      "    LARSEN & TOUBRO LIMITED  ICICI BANK LTD  HDFC BANK LIMITED  \\\n",
      "0                       NaN             NaN                NaN   \n",
      "1                     81.53          127.80             248.45   \n",
      "2                     79.56          125.25             234.85   \n",
      "3                     76.94          124.30             236.00   \n",
      "4                     76.57          127.50             236.30   \n",
      "5                     76.25          140.25             247.00   \n",
      "8                     76.27          140.25             243.80   \n",
      "9                     77.47          135.45             237.70   \n",
      "10                    78.12          131.55             235.60   \n",
      "11                    73.34          125.55             235.20   \n",
      "\n",
      "    OIL & NATURAL GAS CORP LTD  STATE BANK OF INDIA  \\\n",
      "0                          NaN                  NaN   \n",
      "1                       152.57               229.09   \n",
      "2                       150.27               226.30   \n",
      "3                       155.37               226.82   \n",
      "4                       152.57               230.41   \n",
      "5                       156.73               227.81   \n",
      "8                       155.37               228.10   \n",
      "9                       161.27               245.55   \n",
      "10                      157.13               238.81   \n",
      "11                      149.13               214.93   \n",
      "\n",
      "    HINDUSTAN UNILEVER LIMITED  BHARAT HEAVY ELECTRICALS  NTPC LIMITED  \\\n",
      "0                          NaN                       NaN           NaN   \n",
      "1                       241.35                     94.33           NaN   \n",
      "2                       241.00                     86.93           NaN   \n",
      "3                       239.95                     85.80           NaN   \n",
      "4                       240.65                     85.30           NaN   \n",
      "5                       240.50                     88.23           NaN   \n",
      "8                       239.10                     86.63           NaN   \n",
      "9                       241.55                     91.45           NaN   \n",
      "10                      246.50                     93.08           NaN   \n",
      "11                      249.80                     84.08           NaN   \n",
      "\n",
      "    TATA CONSULTANCY SVS LTD  GAIL INDIA LTD  CIPLA LTD  \\\n",
      "0                        NaN             NaN        NaN   \n",
      "1                        NaN           54.97      84.12   \n",
      "2                        NaN           53.63      84.32   \n",
      "3                        NaN           52.33      85.21   \n",
      "4                        NaN           52.03      85.22   \n",
      "5                        NaN           51.93      84.92   \n",
      "8                        NaN           49.20      85.30   \n",
      "9                        NaN           51.40      83.87   \n",
      "10                       NaN           49.37      84.10   \n",
      "11                       NaN           47.27      80.44   \n",
      "\n",
      "    GRASIM INDUSTRIES LTD  TATA POWER CO LTD  \n",
      "0                     NaN                NaN  \n",
      "1                  315.51             133.35  \n",
      "2                  311.01             130.30  \n",
      "3                  301.66             126.50  \n",
      "4                  299.76             126.80  \n",
      "5                  299.06             129.90  \n",
      "8                  298.71             130.20  \n",
      "9                  300.96             131.45  \n",
      "10                 307.61             130.80  \n",
      "11                 289.21             119.30  \n"
     ]
    }
   ],
   "source": [
    "#empty rows elimination from stock prices dataset\n",
    "\n",
    "#dependencies\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "#input dataset and dimensions of the dataset\n",
    "StockFileName = 'Lesson2FinDataWranglingSampledata.csv'\n",
    "Rows = 12      #excluding headers\n",
    "Columns = 18  #excluding date\n",
    "\n",
    "#read stock prices \n",
    "df = pd.read_csv(StockFileName,  nrows= Rows)\n",
    "\n",
    "#extract asset Names\n",
    "assetNames = df.columns[1:Columns+1].tolist()\n",
    "print(assetNames)\n",
    "\n",
    "#clean the stock dataset of empty rows\n",
    "StockData = df.iloc[0:, 1:]\n",
    "dfClean = EmptyRowsElimination(StockData)\n",
    "print('\\nData cleaning completed!')\n",
    "[rows, cols]=dfClean.shape\n",
    "print('Dimensions of the cleaned dataset', dfClean.shape)\n",
    "print('Cleaned dataset: \\n', dfClean)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It can be clearly seen that rows 6 and 7 which held empty rows have been eliminated. However, the two empty columns corresponding to stocks NTPC Ltd and Tata Consultancy Services Ltd., continue to persist.  To eliminate columns that are wholly empty one merely has to use **dropna** method with axis =1.  \n",
    "  \n",
    "However, this was not done for the given S&P BSE200 data set only  to draw attention to the fact that empty columns in a stock dataset during a  given period,  may be indicative of the fact that the stocks were not listed on the exchange during the period concerned. Therefore, models which work on such datasets may either choose to eliminate such columns (ignoring short series data as it is said),   if the time period is important for the study, or may choose to work over the dataset pulled for an appropriate time period,  when the stocks were listed on the exchange and traded, thereby leaving no columns empty, for reasons otherwise. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Summarizing, a finance dataset has to be essentially cleaned of all empty rows and columns, to enable successful working of models over it. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##   2.3 Filling Missing Values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many a time, datasets are seen interspersed with quite a few empty fields which can only be attributed to improper handling of data downstream or errors that propagated during data handling at source. Termed **missing values** or **missing data** in data analytics, this is a serious problem that cannot be overlooked and needs to be redressed. Fortunately,  a variety of methods are available to tackle missing data in a dataset. Different models working over these datasets adopt different techniques to tackle the missing values in the dataset, based on their justifications and requirements of analyses. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This work, considering the fact that the dataset describes the daily prices of stocks traded on a particular day, the missing values are replaced by the previous day's trading price which is logical and therefore acceptable to the models that will work on such datasets. Needless to say, filling missing values using this method,  insist on the following mandatory requirements:  \n",
    "\n",
    "(1) The dataset should have been cleared of all empty rows and empty columns (as \n",
    "    discussed in Sec. 2.2), and\n",
    "(2) The opening row of the dataset cannot have missing values.  \n",
    "\n",
    "\n",
    "A snapshot of an S&P BSE200 dataset (Bombay Stock Exchange, India) during August 16, 2001 to August 31, 2001 is shown in Fig.2.2. Observe that the dataset is interspersed with quite a few missing values. The data set however, has been cleaned to remove empty rows /columns for the period concerned and the important requirement of  the opening row not having any missing values has been ensured. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](Lesson2Fig2_2.png) \n",
    "####     Fig. 2.2  Snapshot of a finance dataset that calls for filling missing data values "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The function **FillMissingValues** undertakes to fill the missing stock prices with the previous day's prices for a dataset satisfying the mandatory requirements."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "#function to fill missing values of daily stock prices\n",
    "#Mandatory requirements: (1) The dataset should have been cleaned of all empty rows \n",
    "#before missing values are filled, and \n",
    "#(2) the opening row of the dataset should not have any empty fields\n",
    "\n",
    "def FillMissingValues(StockPrices):\n",
    "    \n",
    "    import numpy as np\n",
    "    print('Fill missing values...')\n",
    "    \n",
    "    #identify positions of the missing values in StockPrices\n",
    "    [rows, cols] = np.where(np.asarray(np.isnan(StockPrices)))\n",
    "    \n",
    "    #replace missing value with the previous day's price\n",
    "    for t in range(rows.size):\n",
    "        i=rows[t]\n",
    "        j = cols[t]\n",
    "        if (i-1) >= 0:           \n",
    "            StockPrices.iloc[i,j]= StockPrices.iloc[i-1, j].copy()\n",
    "        else:\n",
    "            print('error')\n",
    "    return StockPrices"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Python code to undertake missing values filling for the dataset shown in Fig. 2.2 is shown below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Asset Labels: ['JINDAL SAW LTD', 'BAJAJ HINDUSTHAN LIMITED', 'LAKSHMI MACHINE WORKS LTD', 'GUJARAT MINERAL DEV CORP LTD', 'MOSER BAER INDIA LTD']\n",
      "Fill missing values...\n",
      "Filling missing values completed!\n",
      "\n",
      "    JINDAL SAW LTD  BAJAJ HINDUSTHAN LIMITED  LAKSHMI MACHINE WORKS LTD  \\\n",
      "0            60.85                      5.10                      69.00   \n",
      "1            60.70                      5.10                      70.00   \n",
      "2            59.50                      5.10                      70.00   \n",
      "3            58.80                      5.06                      69.50   \n",
      "4            57.25                      4.92                      66.49   \n",
      "5            57.05                      5.25                      69.90   \n",
      "6            62.35                      5.25                      70.00   \n",
      "7            66.55                      4.86                      70.00   \n",
      "8            68.50                      4.86                      70.50   \n",
      "9            69.65                      4.86                      70.50   \n",
      "10           65.75                      4.86                      70.00   \n",
      "\n",
      "    GUJARAT MINERAL DEV CORP LTD  MOSER BAER INDIA LTD  \n",
      "0                           4.80                 86.08  \n",
      "1                           4.75                 86.55  \n",
      "2                           4.68                 86.87  \n",
      "3                           4.73                 86.67  \n",
      "4                           4.83                 86.65  \n",
      "5                           5.07                 84.27  \n",
      "6                           5.01                 82.45  \n",
      "7                           4.95                 82.92  \n",
      "8                           4.96                 83.62  \n",
      "9                           4.95                 84.22  \n",
      "10                          4.94                 83.20  \n"
     ]
    }
   ],
   "source": [
    "#filling missing values of stock prices dataset\n",
    "\n",
    "#dependencies\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "#input dataset and the dimensions of the cleaned dataset\n",
    "StockFileName = 'Lesson2MissingValBSE200.csv'\n",
    "Rows = 11  #excluding header\n",
    "Columns = 5  #excluding date\n",
    "\n",
    "#read stock prices from the dataset as a pandas dataframe\n",
    "df = pd.read_csv(StockFileName,  nrows= Rows)\n",
    "StockData = df.iloc[0:, 1:]\n",
    "\n",
    "#extract asset labels\n",
    "assetLabels = df.columns[1:Columns+1].tolist()\n",
    "print('Asset Labels:',assetLabels)\n",
    "\n",
    "#impute missing data with previous day's trading price\n",
    "stockDataClean = FillMissingValues(StockData)\n",
    "print('Filling missing values completed!\\n')\n",
    "print(stockDataClean)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The two functions **EmptyRowsElimination** and **FillMissingValues** have been independently demonstrated over two live datasets only for purposes of illustration. In practice, needless to say, every stock price dataset will have to undergo the two cleaning operations in a sequence, to render them fit for models that will work on critical investment decision making using these historical datasets. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Suggested further reading"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are innumerable books and blogs that have been written on Data Wrangling or more familiarly Data Pre-processing. It is therefore left to the reader's choice to select those that cater to their individual needs.   \n",
    "\n",
    "**scikit-learn** is a Python Library built on **NumPy**, **SciPy** and **matplotlib**,  and comprises a  repertoire of tools to work on **machine learning**. The package **sklearn.preprocessing** in scikit-learn, provides numerous utility functions and transformers to help undertake data preprocessing. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "####  Next .....Lesson 3:  Heuristic Portfolio Selection\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](Lesson2ExitTailImage.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
